Precision-Recall versus Accuracy and the Role of Large Data Sets

نویسندگان

  • Brendan Juba
  • Hai S. Le
چکیده

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set has a negative impact on the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that the only cure for class-imbalance is a larger number of examples, a finding we also illustrate empirically. We further observe that for many applications high precision is actually needed, and hence these class-imbalance dependent measures are indeed more relevant than the accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest

Background & objective: Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and smal...

متن کامل

Evaluation of Updating Methods in Building Blocks Dataset

With the increasing use of spatial data in daily life, the production of this data from diverse information sources with different precision and scales has grown widely. Generating new data requires a great deal of time and money. Therefore, one solution is to reduce costs is to update the old data at different scales using new data (produced on a similar scale). One approach to updating data i...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Effect of Spaced Repetition on Iranian EFL Learners’ Form Recall of English Single Words and Collocations

Acquiring vocabulary has always been recognized as a significant and challenging part of language learning process. In this study, the researcher examined the extent to which form recall of target lexical items by learners of English as a foreign language (EFL) is affected by a) repetition and b) by the type of target item; single words versus collocations. The treatment consisted of non-commun...

متن کامل

Effect of Spaced Repetition on Iranian EFL Learners’ Form Recall of English Single Words and Collocations

Acquiring vocabulary has always been recognized as a significant and challenging part of language learning process. In this study, the researcher examined the extent to which form recall of target lexical items by learners of English as a foreign language (EFL) is affected by a) repetition and b) by the type of target item; single words versus collocations. The treatment consisted of non-commun...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017